Sentiment analysis, also known as opinion mining, is the process of extracting and analyzing the emotions and attitudes expressed in text data. In recent years, sentiment analysis has gained significant attention in the field of natural language processing, as it provides valuable insights into the subjective opinions and sentiments of individuals or groups towards a particular topic or product. One area where sentiment analysis can be applied is in the analysis of lyrics data, which can reveal the underlying emotions and themes expressed in songs across different genres and cultures. By applying sentiment analysis techniques to lyrics data, researchers and industry professionals can gain a deeper understanding of the emotional impact and cultural significance of music, as well as the social and political contexts in which it is created and consumed. In this context, this project aims to conduct sentiment analysis on a large dataset of lyrics data, in order to explore the emotional content and sentiment patterns in popular music. Unsupervised sentiment analysis is peformed to specifically answer the following questions of interest.

  • What is the prevailing sentiment of popular songs in various countries?

  • How do the sentiments expressed in song lyrics vary across different countries?

In [1]:
# IMPORT DEPENDENCIES 
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import time 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import chart_studio
import chart_studio.plotly as py
import chart_studio.tools as tls
import plotly.offline as pyo
pyo.init_notebook_mode()
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from nrclex import NRCLex
In [2]:
# READ DATA 
song_data = pd.read_csv('../Data/merged_finaltop100_revised.csv') 
# REMOVE ROWS WITH NULL VALUES  
song_data = song_data.dropna()
song_data.head() 
Out[2]:
Unnamed: 0 track_id artist_names track_name source rank weeks_on_chart streams country danceability ... duration_ms time_signature album_release_date lyrics lyrics_trans continent iso_alpha3 len_words_orig len_words_trans lyrics_clean
0 0 0yLdNVWF3Srea0uzk55zFn Miley Cyrus Flowers Columbia 1 5 124198 United Arab Emirates 0.707 ... 200455.0 4.0 2023-01-13 We were good, we were gold\nKinda dream that c... we were good we were gold kinda dream that can... Asia ARE 334 334 good gold dream sell right til build home watc...
1 1 1Qrg8KqiBpW07V7PNxwwwL SZA Kill Bill Top Dawg Entertainment/RCA Records 2 10 106927 United Arab Emirates 0.644 ... 153947.0 4.0 2022-12-08 I'm still a fan even though I was salty\nHate ... im still a fan even though i was salty hate to... Asia ARE 362 362 fan even though salty hate see broad know happ...
2 2 6AQbmUe0Qwf5PZnt4HmTXv PinkPantheress, Ice Spice Boy's a liar Pt. 2 Warner Records 3 2 83627 United Arab Emirates 0.696 ... 131013.0 4.0 2023-02-03 Take a look inside your heart\nIs there any ro... take a look inside your heart is there any roo... Asia ARE 372 372 take look inside heart room room would hold br...
3 3 0WtM2NBVQNNJLh6scP13H8 Rema, Selena Gomez Calm Down (with Selena Gomez) Mavin Records / Jonzing World 4 25 79714 United Arab Emirates 0.801 ... 239318.0 4.0 2022-08-25 Vibez\nOh, no\nAnother banger\nBaby, calm down... vibez oh no another banger baby calm down calm... Asia ARE 495 495 another banger baby calm calm girl body put he...
4 4 2dHHgzDwk4BJdRwy9uXhTO Metro Boomin, The Weeknd, 21 Savage Creepin' (with The Weeknd & 21 Savage) Republic Records 5 11 79488 United Arab Emirates 0.715 ... 221520.0 4.0 2022-12-02 Ooh, ooh-ooh\nOoh-ooh-ooh, ooh, ooh-ooh (Just ... ooh oohooh oohoohooh ooh oohooh just cant beli... Asia ARE 458 456 believe man want somebody say saw person kiss ...

5 rows × 30 columns

I. Sentiment Scores: Vader ¶

In order to perform sentiment analysis on lyrics data for this project, a library called VADER, which stands for Valence Aware Dictionary and sEntiment Reasoner, is utilized. VADER is a widely used sentiment analysis tool that employs a lexicon of words and phrases that have been rated for their positive or negative sentiment. It is capable of analyzing text data from various sources, such as social media posts, customer reviews, song lyrics, and news articles, to determine the overall sentiment expressed in the text. VADER considers not only individual words, but also context to accurately evaluate sentiment. Furthermore, it takes into account intensifiers and negations to ensure that the sentiment is interpreted correctly. VADER generates a compound score between -1 (extremely negative) and 1 (extremely positive), making it an efficient and effective method for analyzing large volumes of text data. The following results are obtained using Vader to analyze the sentiment of popular song lyrics.

In [4]:
def get_sentiment_1(text): 
    '''get sentiment scores of a text'''
    sia = SentimentIntensityAnalyzer() #instantiate sentiment analyzer object
    #newWords = {'good': 2.0, 'down': 2.0, 'normal': 2.0, 'well': 2.0}
    #sid.lexicon.update(newWords) #update words if needed 
    sentiment_score = sia.polarity_scores(text) #sentiment score of text 
    return sentiment_score 
In [5]:
def get_scores_1(lst): 
    '''get scores for each text'''
    scores_ls = [] 
    for i in lst: 
        score = get_sentiment_1(i)
        scores_ls.append(score)
    return scores_ls 
In [6]:
%%time 
# GET SENTIMENT SCORES FOR EVERY LYRICS 
scores_df = pd.DataFrame(get_scores_1(song_data['lyrics_clean']))
scores_df
CPU times: total: 48.9 s
Wall time: 50.2 s
Out[6]:
neg neu pos compound
0 0.017 0.249 0.734 0.9996
1 0.161 0.401 0.438 0.9953
2 0.208 0.446 0.346 0.9784
3 0.042 0.605 0.353 0.9970
4 0.084 0.654 0.261 0.9814
... ... ... ... ...
6804 0.088 0.718 0.195 0.9837
6805 0.139 0.747 0.114 -0.2774
6806 0.048 0.426 0.526 0.9991
6807 0.000 0.843 0.157 0.9961
6808 0.129 0.805 0.066 -0.8885

6809 rows × 4 columns

In [7]:
# ADD TO DATAFRAME 
dt = song_data[['country', 'continent', 'lyrics_clean']].reset_index() 
df_all = pd.concat([dt, scores_df], axis=1)
df_all
Out[7]:
index country continent lyrics_clean neg neu pos compound
0 0 United Arab Emirates Asia good gold dream sell right til build home watc... 0.017 0.249 0.734 0.9996
1 1 United Arab Emirates Asia fan even though salty hate see broad know happ... 0.161 0.401 0.438 0.9953
2 2 United Arab Emirates Asia take look inside heart room room would hold br... 0.208 0.446 0.346 0.9784
3 3 United Arab Emirates Asia another banger baby calm calm girl body put he... 0.042 0.605 0.353 0.9970
4 4 United Arab Emirates Asia believe man want somebody say saw person kiss ... 0.084 0.654 0.261 0.9814
... ... ... ... ... ... ... ... ...
6804 7295 South Africa Africa cook thing man get high fade pure way hullabal... 0.088 0.718 0.195 0.9837
6805 7296 South Africa Africa sweet love yeah didnt mean say didnt love tigh... 0.139 0.747 0.114 -0.2774
6806 7297 South Africa Africa first wisdom fear hear child piano first wisdo... 0.048 0.426 0.526 0.9991
6807 7298 South Africa Africa mother mother mother mother mother mother moth... 0.000 0.843 0.157 0.9961
6808 7299 South Africa Africa let dude know work go closet go break bone saw... 0.129 0.805 0.066 -0.8885

6809 rows × 8 columns

In [8]:
# COMPUTE AVERAGE SENTIMENT SCORES FOR EACH COUNTRY 
grouped_df = df_all.groupby('country').mean()[['neg','neu','pos','compound']].reset_index()
grouped_df
C:\Users\kayan\AppData\Local\Temp\ipykernel_30288\3038061562.py:2: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

Out[8]:
country neg neu pos compound
0 Argentina 0.163455 0.563202 0.273343 0.522222
1 Australia 0.141630 0.580580 0.277740 0.537319
2 Austria 0.147273 0.555970 0.296677 0.572676
3 Belarus 0.193768 0.543474 0.262789 0.331445
4 Belgium 0.146310 0.580670 0.272930 0.424486
... ... ... ... ... ...
68 United Arab Emirates 0.132875 0.589146 0.277885 0.570010
69 United Kingdom 0.146571 0.579173 0.274194 0.518237
70 Uruguay 0.158385 0.563583 0.278042 0.565343
71 Venezuela 0.149444 0.572525 0.278030 0.490106
72 Vietnam 0.141152 0.534359 0.324511 0.708450

73 rows × 5 columns

In [9]:
# PLOT 
pd.options.plotting.backend = "plotly"
grouped_df.plot.bar(y='country', x=['neg','neu','pos','compound'], 
                    title = 'Average sentiment scores by country',
                    template = 'plotly_dark')

Figure 4.1

II. Sentiment Scores: TextBlob ¶

Another python library used to perform sentiment analysis on lyrics data is TextBlob, which provides a simple and intuitive interface for performing sentiment analysis on text data. Unlike Vader, it uses a combination of pattern recognition and machine learning techniques to evaluate the sentiment of a given piece of text. TextBlob analyzes the text input, breaking it down into individual words and evaluating their polarity, or positive or negative sentiment. It also takes into account the context of the words, as well as any modifiers or intensifiers that might influence their sentiment. TextBlob generates a sentiment score for the entire text, ranging from -1 (extremely negative) to 1 (extremely positive), as well as a subjectivity score, indicating the degree to which the text expresses an opinion versus being factual.

In [15]:
def get_sentiment_2(text): 
    '''get sentiment scores of a text'''
    blob = TextBlob(text) #instantiate sentiment analyzer object
    sentiment_score = blob.sentiment #sentiment score of text 
    dic = {'polarity': sentiment_score.polarity, 'subjectivity': sentiment_score.subjectivity} 
    return dic
In [16]:
def get_scores_2(lst): 
    '''get scores for each text'''
    scores_ls = [] 
    for i in lst:  
        score = get_sentiment_2(i)
        scores_ls.append(score)
    return scores_ls 
In [17]:
%%time 
# GET SENTIMENT SCORES FOR EVERY LYRICS 
scores_df1 = pd.DataFrame(get_scores_2(song_data['lyrics_clean']))
df_all1 = pd.concat([df_all, scores_df1], axis=1)
df_all1
CPU times: total: 4.83 s
Wall time: 4.87 s
Out[17]:
index country continent lyrics_clean neg neu pos compound polarity subjectivity
0 0 United Arab Emirates Asia good gold dream sell right til build home watc... 0.017 0.249 0.734 0.9996 0.499696 0.549696
1 1 United Arab Emirates Asia fan even though salty hate see broad know happ... 0.161 0.401 0.438 0.9953 0.270764 0.394594
2 2 United Arab Emirates Asia take look inside heart room room would hold br... 0.208 0.446 0.346 0.9784 0.439015 0.563258
3 3 United Arab Emirates Asia another banger baby calm calm girl body put he... 0.042 0.605 0.353 0.9970 0.141603 0.504190
4 4 United Arab Emirates Asia believe man want somebody say saw person kiss ... 0.084 0.654 0.261 0.9814 0.152885 0.368269
... ... ... ... ... ... ... ... ... ... ...
6804 7295 South Africa Africa cook thing man get high fade pure way hullabal... 0.088 0.718 0.195 0.9837 0.163435 0.400835
6805 7296 South Africa Africa sweet love yeah didnt mean say didnt love tigh... 0.139 0.747 0.114 -0.2774 0.010605 0.311858
6806 7297 South Africa Africa first wisdom fear hear child piano first wisdo... 0.048 0.426 0.526 0.9991 0.500275 0.801515
6807 7298 South Africa Africa mother mother mother mother mother mother moth... 0.000 0.843 0.157 0.9961 0.200000 1.000000
6808 7299 South Africa Africa let dude know work go closet go break bone saw... 0.129 0.805 0.066 -0.8885 -0.221528 0.622917

6809 rows × 10 columns

In [18]:
#c COMPUTE AVERAGE SCORES FOR EVERY COUNTRY 
grouped_df = df_all1.groupby('country').mean()[['neg','neu','pos','compound', 
                                                'polarity','subjectivity']].reset_index()
grouped_df
C:\Users\kayan\AppData\Local\Temp\ipykernel_30288\2918057648.py:2: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

Out[18]:
country neg neu pos compound polarity subjectivity
0 Argentina 0.163455 0.563202 0.273343 0.522222 0.104611 0.527113
1 Australia 0.141630 0.580580 0.277740 0.537319 0.144419 0.501962
2 Austria 0.147273 0.555970 0.296677 0.572676 0.119142 0.500247
3 Belarus 0.193768 0.543474 0.262789 0.331445 0.104841 0.513782
4 Belgium 0.146310 0.580670 0.272930 0.424486 0.128526 0.514742
... ... ... ... ... ... ... ...
68 United Arab Emirates 0.132875 0.589146 0.277885 0.570010 0.165916 0.491276
69 United Kingdom 0.146571 0.579173 0.274194 0.518237 0.115046 0.489386
70 Uruguay 0.158385 0.563583 0.278042 0.565343 0.110547 0.533040
71 Venezuela 0.149444 0.572525 0.278030 0.490106 0.109068 0.518744
72 Vietnam 0.141152 0.534359 0.324511 0.708450 0.163098 0.514426

73 rows × 7 columns

In [37]:
# PLOT 
pd.options.plotting.backend = "plotly"
grouped_df.plot.bar(y='country', x=['polarity','subjectivity'], 
                title = 'Average scores by country', template = 'plotly_dark')

Figure 4.2

In this plot, Pakistan's polarity bar is not visible due to its exceptionally low average polarity score of 0.000327.

III. Maps¶

In [23]:
# GET ISO-3 CODES FOR EACH COUNTRY 
iso = song_data[['country','continent', 'iso_alpha3']].drop_duplicates()
merged_df = pd.merge(grouped_df, iso, on='country')
merged_df
Out[23]:
country neg neu pos compound polarity subjectivity continent iso_alpha3
0 Argentina 0.163455 0.563202 0.273343 0.522222 0.104611 0.527113 South America ARG
1 Australia 0.141630 0.580580 0.277740 0.537319 0.144419 0.501962 Oceania AUS
2 Austria 0.147273 0.555970 0.296677 0.572676 0.119142 0.500247 Europe AUT
3 Belarus 0.193768 0.543474 0.262789 0.331445 0.104841 0.513782 Europe BLR
4 Belgium 0.146310 0.580670 0.272930 0.424486 0.128526 0.514742 Europe BEL
... ... ... ... ... ... ... ... ... ...
68 United Arab Emirates 0.132875 0.589146 0.277885 0.570010 0.165916 0.491276 Asia ARE
69 United Kingdom 0.146571 0.579173 0.274194 0.518237 0.115046 0.489386 Europe GBR
70 Uruguay 0.158385 0.563583 0.278042 0.565343 0.110547 0.533040 South America URY
71 Venezuela 0.149444 0.572525 0.278030 0.490106 0.109068 0.518744 South America VEN
72 Vietnam 0.141152 0.534359 0.324511 0.708450 0.163098 0.514426 Asia VNM

73 rows × 9 columns

In [43]:
# TOP AND BOTTOM 5 COMPOUND SCORES 
merged_df[['country', 'compound']].sort_values('compound',ascending =False)
Out[43]:
country compound
34 Japan 0.779608
59 South Korea 0.760777
26 Hong Kong 0.716759
72 Vietnam 0.708450
22 Germany 0.682694
... ... ...
54 Romania 0.337070
3 Belarus 0.331445
15 Dominican Republic 0.316231
67 Ukraine 0.299052
65 Turkey 0.232454

73 rows × 2 columns

In [24]:
# PLOT RESULTS 
fig = px.scatter_geo(merged_df, locations='iso_alpha3', color="compound",
                     hover_name="country", size="compound", 
                     title='Average sentiment score (compound) by country' )
fig.update_geos(showcoastlines=True, coastlinecolor="white",
                showocean=True, oceancolor="black")
fig.show()

Figure 4.3

The average compound score appears to differ among various countries, with Japan, South Korea, Hong Kong, Vietnam, and Germany having the highest average compound scores. This suggests that people in these countries prefer to listen to songs with predominantly positive sentiments. Conversely, Turkey and Ukraine have the lowest average compound score. While the low average sentiment score in Ukraine may be attributed to the ongoing war in their country, there may be other political and social issues in Turkey that lead its people to favor songs with less positive themes.

In [44]:
# TOP AND BOTTOM 5 POLARITY SCORES 
merged_df[['country', 'polarity']].sort_values('polarity',ascending =False)
Out[44]:
country polarity
30 Indonesia 0.191214
39 Malaysia 0.174983
64 Thailand 0.168123
68 United Arab Emirates 0.165916
72 Vietnam 0.163098
... ... ...
23 Greece 0.078557
15 Dominican Republic 0.063725
65 Turkey 0.055759
29 India 0.052369
47 Pakistan 0.000327

73 rows × 2 columns

In [25]:
# PLOT RESULTS 
fig1 = px.scatter_geo(merged_df, locations='iso_alpha3', color="polarity",
                     hover_name="country", size="polarity", 
                     title='Average sentiment score (polarity) by country' )
fig1.update_geos(showcoastlines=True, coastlinecolor="white",
                 showocean=True, oceancolor="black")
fig1.show()

Figure 4.4

Similar to the average compound score plot in Figure 4.3, the average polarity scores for different countries also vary. In Figure 4.2 above, Pakistan's polarity bar is not visible due to its exceptionally low average polarity score of 0.000327. Other countries with low sentiment or polarity scores using TextBlob include Turkey, India, Dominican Republic, and Greece. Conversely, Indonesia, Malaysia, Thailand, Vietnam, and UAE have the highest average polarity scores.

In [42]:
# TOP AND BOTTOM 5 SUBJECTIVITY SCORES 
merged_df[['country', 'subjectivity']].sort_values('subjectivity',ascending =False)
Out[42]:
country subjectivity
53 Portugal 0.541602
9 Chile 0.540034
6 Brazil 0.538437
64 Thailand 0.537607
48 Panama 0.537318
... ... ...
51 Philippines 0.481905
55 Saudi Arabia 0.480588
23 Greece 0.473276
45 Nigeria 0.470287
29 India 0.460736

73 rows × 2 columns

In [26]:
# PLOT RESULTS 
fig1 = px.scatter_geo(merged_df, locations='iso_alpha3', color="subjectivity",
                     hover_name="country", size="subjectivity", 
                     title='Average subjectivity by country' )
fig1.update_geos(showcoastlines=True, coastlinecolor="white",
                 showocean=True, oceancolor="black")
fig1.show()

Figure 4.5

As for subjectivity, Portugal, Chile, Brazil, Thailand, and Panama have the highest average subjectivity scores among the countries analyzed. This means that these countries tend to listen to songs that likely contain subjective language and expressions of personal feelings or beliefs. In contrast, Philippines, Saudi Arabia, Greece, Nigeria, and India have the lowest subjectivity scores which means that they listen to songs with lyrics that are more objective and fact-based.

III. Emotions: NRCLex ¶

NRC Emotion Lexicon, or NRClex, is another lexicon-based approach to sentiment analysis that focuses on identifying and quantifying the emotional content of text. Developed by the National Research Council of Canada, NRClex assigns a score for each of eight basic emotions - anger, anticipation, disgust, fear, joy, sadness, surprise, and trust - as well as two additional sentiments - negative and positive. The lexicon is built based on a set of over 27,000 English words and their associations with each emotion, and has been expanded to include words from other languages. One advantage of NRClex is its ability to detect nuanced emotional expressions, making it a useful tool for understanding how individuals feel about a particular topic or product. However, its effectiveness may be limited in cases where a text contains sarcasm, irony, or other forms of indirect speech. For this project, NRCLex is used to identify how the popular song lyrics are distributed emotion-wise.

In [28]:
def get_emotion(ls): 
    '''get dominant emotions'''
    emotions =[]
    for i in ls:
        text_obj = NRCLex(i)
        #print(text_object.raw_emotion_scores)
        emotion= text_obj.affect_frequencies
        max_key =max(emotion, key=emotion.get)
        emotions.append(max_key)
    return emotions 
In [29]:
%%time
# GET EMOTIONS 
em = get_emotion(song_data['lyrics_clean'])
len(em)
CPU times: total: 4.91 s
Wall time: 4.94 s
Out[29]:
6809
In [30]:
# CONVERT TO DATAFRAME 
df_all1['emotions'] = em
df_all1.head() 
Out[30]:
index country continent lyrics_clean neg neu pos compound polarity subjectivity emotions
0 0 United Arab Emirates Asia good gold dream sell right til build home watc... 0.017 0.249 0.734 0.9996 0.499696 0.549696 positive
1 1 United Arab Emirates Asia fan even though salty hate see broad know happ... 0.161 0.401 0.438 0.9953 0.270764 0.394594 positive
2 2 United Arab Emirates Asia take look inside heart room room would hold br... 0.208 0.446 0.346 0.9784 0.439015 0.563258 negative
3 3 United Arab Emirates Asia another banger baby calm calm girl body put he... 0.042 0.605 0.353 0.9970 0.141603 0.504190 positive
4 4 United Arab Emirates Asia believe man want somebody say saw person kiss ... 0.084 0.654 0.261 0.9814 0.152885 0.368269 positive
In [31]:
# PLOT RESULTS 
fig = px.histogram(df_all1, x="emotions", color = 'emotions', 
                   color_discrete_sequence=px.colors.qualitative.Dark2,
                  title = 'Emotion-wise distribution of lyrics', template = 'plotly_dark')
fig.show()

Figure 4.6

Upon examining this graph, it becomes apparent that popular songs predominantly express positive emotions and a relatively small number of song lyrics convey feelings of anger. This is useful because it may enable us to draw inferences about the collective emotions of music listeners and identify any global social or political issues that may be contributing to this trend.

In [45]:
# PLOT EMOTIONS BY REGION 
fig = px.histogram(df_all1, x="emotions", color = 'continent', 
                   color_discrete_sequence=px.colors.qualitative.Pastel, 
                  title = 'Regional lyrics emotions', template = 'plotly_dark')
fig.show()

Figure 4.7

It appears that the most songs with positve emotions are popular in European and Asian countries.